KSTAR: An algorithm to predict patient-specific kinase activities from phosphoproteomic data

Kinase inhibitors as targeted therapies have played an important role in improving cancer outcomes. However, there are still considerable challenges, such as resistance, non-response, patient stratification, polypharmacology, and identifying combination therapy where understanding a tumor kinase activity profile could be transformative. Here, we develop a graph- and statistics-based algorithm, called KSTAR, to convert phosphoproteomic measurements of cells and tissues into a kinase activity score that is generalizable and useful for clinical pipelines, requiring no quantification of the phosphorylation sites. In this work, we demonstrate that KSTAR reliably captures expected kinase activity differences across different tissues and stimulation contexts, allows for the direct comparison of samples from independent experiments, and is robust across a wide range of dataset sizes. Finally, we apply KSTAR to clinical breast cancer phosphoproteomic data and find that there is potential for kinase activity inference from KSTAR to complement the current clinical diagnosis of HER2 status in breast cancer patients.

Supplementary Note 1: Assessing the impact of pruning and parameter selection in KSTAR Goal 1. Profile the impact of the heuristic prune on the characteristics of the kinase-substrate network 2. Explore the impact of applying different cutoffs to experimental datasets when determining which phosphorylation sites to use as evidence.

Methods
For comparison to commonly used kinase-substrate networks, kinase-substrate annotations were downloaded from PhosphoSitePlus [1]. The full NetworKIN prediction graph was downloaded for the entire phosphoproteome and edges with weights less than 1 were removed [2]. A threshold of 1 was selected in order to balance the number of edges that are removed from the network and the inaccuracy often observed at low thresholds.
To look at sensitivity of KSTAR results to different evidence cutoffs in phosphoproteomic experiments, we generated kinase activity predictions using different evidence sizes (number of phosphorylation sites identified in original experiment used for prediction) for a tyrosine dataset [3] and a serine/threonine dataset [4]. In both cases, the first test used all of the sites identified in the experiment, and subsequent tests with smaller evidence sizes removed the least abundant sites from analysis to mimic the effect of thresholding.   [1], kinase-substrate predictions from NetworKIN thresholded with a value of 1 (remove edges with edge weights less than 1)(orange) [2], and the KSTAR ensemble of pruned networks (green). A) Number of unique substrates found within each network. B) Fraction of sites identified within an experiment that are also found within the kinase-substrate network. Experiments include phosphoproteomic dataset for predictions found in Figures 2,5, and 6 of the main text. Each point on the plot represents a unique experiment (Tyrosine: n = 11 independent biological experiments, Serine/Threonine: n = 6 independent biological experiments). While prediction algorithms using either PhosphoSitePlus or NetworKIN will often have to drop > 50% of sites identified in an experiment, KSTAR typically loses < 20%. This problem is especially apparent in tyrosine networks. C) Pairwise network similarity between different kinases in each network, defined by the Jaccard similarity, calculated as JI = total number of substrates connected to both kinases total number of substrates of connected to either kinase . Each point represents a different pairwise comparison between kinases in the kinase-  Figure 2. Impact of experiment threshold on tyrosine kinase predictions Using control data of K562 chronic myeloid leukemia cell lines [3], we assessed variance in KSTAR predictions for tyrosine kinases using different thresholds (cutoff value to deterimine whether a phosphorylation site should be included in evidence). The threshold determines the total number of sites used as evidence for prediction. For this test, predictions were based evidence sizes ranging from 10 -1665 (all identified) sites. A) KSTAR predictions for kinases that have predicted activity in at least one test (F P R ≤ 0.05). B) Distribution of KSTAR activity scores for all kinases at each evidence size. Box indicates median (center line), 25th and 75th percentiles (box boundaries), 1.5x the IQR of the box edge (whiskers), and any outliers beyond 1.5x IQR (points). If no outliers exist, whiskers indicate maxima or minima. For each condition, statistics were derived from all tyrosine kinases with predictions in KSTAR (n = 50). C) Histograms demonstrating the distribution of activity scores when 25, 100, 750, 1500 sites are used for prediction. D) Correlation of activity rankings between predictions of different evidence sizes. Source data are provided as a separate Source Data file with this paper.  Figure 3. Impact of experiment threshold on serine/threonine kinase predictions Using control data of BT-474 Breast cancer cell lines [4], we assessed variance in KSTAR predictions for serine/threonine kinases using different thresholds (cutoff value to deterimine whether a phosphorylation site should be included in evidence). The threshold determines the total number of sites used as evidence for prediction. For this test, predictions were based on evidence sizes ranging from 10 -10345 (all identified) sites. A) KSTAR predictions for kinases that have predicted activity in at least one test (F P R ≤ 0.05). B) Distribution of KSTAR activity scores for all kinases at each evidence size. Box indicates median (center line), 25th and 75th percentiles (box boundaries), 1.5x the IQR of the box edge (whiskers), and any outliers beyond 1.5x IQR (points). If no outliers exist, whiskers indicate maxima or minima. For each condition, statistics were derived from all serine/threonine kinases with activity in at least one condition in KSTAR (n = 55). C) Histograms demonstrating the distribution of activity scores when 100, 2000, 6000, or 10000 sites are used for prediction. D) Correlation of activity rankings between predictions of different evidence sizes. Source data are provided as a separate Source Data file with this paper.

5/71
Supplementary Note 2: Controlling for kinase-and experiment-specific false positive rates Goal 1. Assess the bias found within different phosphoproteomic databases by predicting activity from random samplings of the database 2. Assess the bias found with phosphoproteomic datasets by looking at the distribution of study bias across the sites identified in an experiment 3. Assess the relationship between study bias of a substrate and quantified log fold changes

Methods
In order to measure the false positive rate, we randomly created 200 random datasets, each composed of 250 randomly selected phosphotyrosine sites using different phosphoproteomic databases as the background. We set the desired false positive rate at 0.05 and therefore expect, on average, about 10 positives per kinase across all datasets. The false positive results are from KSTAR networks before the final addition of controlling for the distribution of compendia was added (Supplementary .The size of the dataset was selected based on common dataset sizes for phosphotyrosines and this value is much smaller than the smallest of the compendia background.

Summary of Results
The first four figures here have been ordered from backgrounds that produce the highest false positive rates to those that produce the lowest false positive rates. It became clear that the kinases with high false positive rates were consistent across all compendia (except those from PhosphoSitePlus, which is the largest and predominantly comprised of mass spectrometry identified sites). In short, FYN, LCK, and HCK demonstrated significant levels (100%) false positive rates. On the other end of the range, it became clear that most kinases were producing less than expected false positive rates (0%), which suggested that it is also more difficult to yield true positives for these kinases. Additionally, we noticed a trend that the false positive rates were highest from the compendia that overlapped most with the genesis of the NetworKIN prediction models (Phospho.ELM [5], Supplementary Figure 4) and lowest from the compendia most separated from training data that formed the networks (PhosphoSitePlus [1], Supplementary Figure 7). We therefore hypothesized that there was a direct connection between the study bias of the phosphorylation sites (the more compendia a site is in the more likely it is annotated and was used in training the networks). Supplementary Figure 8 shows that our original KSTAR networks that generated such kinase-specific false positive rates indeed showed skew for more studied substrates connected to the kinases yielding high false positive rates. We next asked if we could measure the false positive rate for random experiments by random draws from the human phosphoproteome. We found that all experiments analyzed in this manuscript were not representative of random samples from the phosphoproteome, with respect to how many different compendia each phosphorylation site is stored (Supplementary Figure 10). We also found that magnitude of fold changes (i.e. quantification) was partially related to the study bias of the individual substrate (Supplementary Figure 11). This suggests that each experiment will have different false positive rates, influenced directly by how well studied the sites are within the dataset. This motivated the final mitigation strategy of selecting random datasets for measuring empirical false positive rates based on sampling sites from the phosphoproteome, such thatt he distribution matches that of the real experiment. The combination of avoiding quantification, normalizing kinase study bias distribution (Supplementary Figure 9), and sampling random sets to control for experiment-specific distributions, helps to control the false positive rate as desired in a kinaseand experiment-specific manner. Supplementary Figure 6. Kinase-Specific False Positive Rates in dbPTM To assess false positive rates obtained from dbPTM [7], 200 random datasets were created by randomly sampling 250 phosphotyrosine sites from dbPTM, and KSTAR was applied (without accounting for study bias, i.e. no study bias constraint in heuristic prune or comparison to random activities from Mann Whitney U test) to generate activity predictions for each random dataset. Each column in the above dotplot corresponds to results from a different random datset. The size of each dot corresponds to the median p-value enrichment obtained across all pruned networks, and is colored based on significance (p <= 0.05). Predictions exhibit lower false positve rates than when sampling from HPRD [6] or Phospho.ELM [5], but still see high kinase-specific false-positive rates for kinases like FYN, LCK, and HCK, and sees more kinases with 0 positives across all datasets. Source data are provided as a separate Source Data file with this paper  Figure 9. Distribution of study bias in final networks We added the constraint that all kinases should be connected to substrates with an equal distribution of study bias, as defined by the number compendia they are observed in. This ensures that no kinase will be connected to more well studied sites than any other kinase, which leads to kinase-specific false positive rates. This leads to a more balanced network, as observed in the above plot indicating the average number of substrates found in each study bias grouping (found in 0, 1, 2, 3, 4, or 5 compendia) across the final KSTAR generated networks (with a study bias constraint, Algorithm 2 in Supplementary Methods).

B) Distribution of study bias in ProteomeScout A) Distribution of study bias in example dataset
Supplementary Figure 10. Distribution of study bias in a real dataset, compared to whole phosphoproteome We found that the phosphorylation sites from experiments were much more likely to contain sites that are well annotated than expected by random chance (i.e. contributing to experimentspecific study bias). A) This is the histogram of the number of compendia phosphorylation sites are observed in for the phosphotyrosines of the PDX dataset from Huang et al. [8]. B) This is the distribution for all phosphotyrosines in the human phosphoproteome, which shows the majority of sites would be annotated by only one compendia (most likely, these are annotated only in PhosphoSitePlus). The three lines represent our final classes of study bias for low (0 compendia), medium (1-2 compendia), and high (more than 3 compendia). This classification was defined and published in our prior work [9]. Source data are provided as a separate Source Data file with this paper.
13/71 a b provided as a separate Source Data file with this paper.
(box edges), and 1.5x the IQR of the box edge (whiskers). Statistical significance was assessed using a one-tailed Mann Whitney U test and p-values indicated on the plot if p < 0.4. We did not find as strong a relationship between study bias and quantification in serine/threonine datasets. Source data are Supplementary Figure 11. Relationship between study bias and quantification in the benchmarking dataset In addition to the likelihood of identification discussed in Supplementary Figure 10, we explored whether quantification and study bias are related by looking at the distribution of fold change magnitudes as a function of study bias. We have defined three classes of study bias based on the number of compendia a site is identified in: low (0 compendia), medium (1-2 compendia), and high (more than 3 compendia). This classification was defined and published in our prior work, KinPred [9]. A) Distribution of log2fold changes for phosphorylated tyrosine sites based on degree of study bias. Quantification was obtained from the tyrosine kinase benchmarking dataset described in Supplementary Table 3 (20 independent experiments, Low: n = 253 sites, Medium: n = 7188 sites, High: n = 10,464).Violin plots indicate the median (white dot), 25th and 75th percentile (box edges), and 1.5x the IQR of the box edge (whiskers). Statistical significance was assessed using a one-tailed Mann Whitney U test and associated p-values are indicated on the plot. We found that well studied tyrosine sites identified in an experiment tend to exhibit larger log2 fold changes than less well studied sites. B) Distribution of log2fold changes for phosphorylated serine or threonine sites based on degree of study bias. Quantification was obtained from the serine/threonine kinase benchmarking dataset described in Supplementary Table 4

Methods
For all inhibition and activation datasets where KSTAR was applied ( Figure 2 in the main text and Supplementary Table 2), a dotplot is provided that includes all kinases with predictions (limited by the kinases with substrate predictions in NetworKIN). For the two serine/threonine datasets ( Figure 5, 6), only kinases with significant activity in at least one sample (F P R ≤ 0.05) were included in the plot due to space constraints.

Table of Contents
Supplementary Figure   Supplementary Figure 13. EGF/HRG stimulation of 184A1 epithelial cells overexpressing HER2 Full KSTAR predictions corresponding to Figure 2A of the main text, with phosphoproteomic data obtained from Wolf-Yadlin et al. [11]. Epithelial cells expressing normal HER2 levels (Parental, P) or overexpressing HER2 (24H) were stimulated with EGF or HRG and phosphorylation was measured at 0, 5, 10 and 30 minutes. For each condition, sites with abundance ratios greater than 0.8 were used as evidence, where ratios were relative to the Parental, 5 minute EGF condition (P EGF 5(min)). Kinase were sorted using hierarchical clustering with Ward linkages. Parental cell stimulated with EGF corresponds to the same culture conditions measured in Supplementary Figure 12. Source data are provided as a separate Source Data file with this paper.  Figure 2B based on data from Chylek et al. [12]. Jurkat cells were stimulated to activate T-cell receptor signaling and phosphorylation was measured at 0, 15, 30, and 60 seconds. For each condition, sites with abundance ratios greater than 0.2 were used as evidence, where ratios were relative to the 0 minute timepoint. Kinases were sorted using hierarchical clustering with Ward linkages. Source data are provided as a separate Source Data file with this paper.  Figure 15. BCR-ABL inhibition by dasatanib Full KSTAR predicitions corresponding to Figure 2C of the main text, based on data from Asmussen et al. [13]. K562 chronic myeloid leukemia (CML) cell line, which contains the BCR-ABL fusion protein, was treated with dasatanib, an ABL inhibitor, for 20 minutes prior to drug washout. PRE refers to pre-treatment, EOE refers to the endof treatment, HDP3 refers to 3 hours post drug washout, and HDP6 refers to 6 hours post drug washout. For each condition, sites with abundance ratios greater than 0.5 were used as evidence, where ratios were relative to the 0 minute timepoint. Kinases were sorted using hierarchical clustering with Ward linkages.  [4]. BT-474 breast cancer cells were treated with one of five different inhibitors (GSK2110183, Ipatasertib, AZD5363, GSK690693, MK-2206). All of the inhibitors work by directly binding the ATP binding pocket except for MK-2206, which is an allosteric inhibitor. For each inhibitor condition, sites with abundance ratios greater than or equal to 1 were used as evidence, relative to pre-treatment contol. Kinases and conditions were sorted using hiearchical clustering with Ward linkages. Source data are provided as a separate Source Data file with this paper.  Figure 2F of the main text, based on data from Kubiniok et al. [14]. Colorectal cancer cell lines Colo205 (BRAF V 600E mutation) or HCT116 (KRAS mutation) were treated with the BRAF inhibitor, vemurafenib, over the course of 60 minutes. In cells containing KRAS mutations, BRAF inhibition leads to the activation of the MAP-ERK pathway rather than inhibiting the pathway, as is the case for cells with a BRAF V 600E mutation. For each timepoint/cell line, sites with abundance greater than median observed abundance were used as evidence. Kinases were sorted using hierarchical clustering with Ward linkages. Source data are provided as a separate Source Data file with this paper.
Supplementary Figure 18. MAPK activity response to BRAF inhibition Predicted MAPK activity after performing quantile normalization across conditions on the original KSTAR activities corresponding to Figure 2E of the main text and Supplementary Figure 17. This demonstrates that despite statistical saturation issues in serine/threonine networks where MAPK/CDK activity is often high, the expected trend is observed (increased MAPK activity in HCT116 cells, decreased MAPK activity in Colo205 cells). Source data are provided as a separate Source Data file with this paper.

22/71
Supplementary Note 4: Comparing KSTAR to other available activity inference algorithms Goals 1. Compare the usability and interpretability of various kinase activity inference methods.
2. Compile datasets for use in comprehensive benchmarking of activity inference methods 3. Compare the accuracy of KSTAR to other kinase activity inference methods for both serine/threonine kinases and tyrosine kinases 4. Assess the robustness of algorithms to data loss and the influence of well studied sites on final predictions generating predictions, either by random removal or targeted removal based on the degree of study bias . Starting with 5% loss and continuing to 50% loss (at increments of 5%) ,the given percent of sites were removed and predictions were regenerated as normal. We then looked at the change in false discovery rate for each prediction, and how this differed between the random and targeted attack. To quantify the differences between these two curves, we defined two metrics: 1) sensitivity to data loss, which is the area under the random attack curve, and 2) sensitivity to study bias, which is the area between the targeted and random attack curve. Given that we are looking at changes to significance of prediction, only KSTAR, KSEA, and PTM-SEA were assessed (KARP and KEA3 do not provide significance of prediction).

Methods
As described in the main text, datasets used in the benchmarking analysis were collected from 16 different publications ( Figure 20). KSTAR, KSEA [26], PTM-SEA [27], KARP [28], and KEA3 [29] were all used to generate predictions about the most enriched/differentially active kinases across the benchmarking dataset (Supplementary Figure 19). Accuracy was calculated based on P hit , defined as the fraction of times a kinase expected to perturbed was identified as differentially active, either based on kinase rank or significance. In addition, we looked at the total number of kinases with available predictions for each test condition, per algorithm ( Supplementary Figures 22 and 23).
In addition to accuracy, we also assessed the impact of losing specific sites from the dataset when

Table of Contents
Supplementary Figure Figure 19. Interpreting kinase activity inference methods. Kinase activity algorithms differ both in the type of data that can be used and in the way that their output should be interpreted. KSTAR, KARP, and KEA3 are all algorithms that can be used in single sample settings, where differential abundances are not available and an activity score/rank is generated for each sample. While KARP and KSTAR were explicitly designed for use with single sample experiments, KEA3 was largely designed for differential setttings (all three can be used in differential settings as well). On the other hand, KSEA and PTM-SEA require differential abundances. Of the single sample approaches described here, only KSTAR provides both an indicator of the degree of activity and the associated significance of activity. Finally, only KSTAR and KEA3 do not rely on quantification for prediction. (caption continued on next page) C,D) To avoid certain kinases exhibiting large influence on P hit results, we collapsed each kinase into a single accuracy score (P hit,k ) and then took the average of each kinase-specific score to obtain the final P hit . The impact of this approach can be seen for tyrosine kinases (C) and serine/threonine kinases (D). E,F) Distribution of tyrosine and serine/threonine kinases. We found that there was still an overrepresentation of serine/threonine kinases in the dataset both before (E) and after (F) equal weighting was applied. As a result, we separated serine/threonine and tyrosine kinases for benchmarking purposes. a b c d (differential activity has FDR ≤ 0.05). KARP and KEA3 were only assessed using the rank-based metric. Not all algorithms generated predictions for all kinases, indicated by the light purple squares. A) Tyrosine kinase prediction accuracy based on rank. B) Tyrosine kinase prediction accuracy based on significance. C) Serine/Threonine kinase prediction accuracy based on rank. D) Serine/Threonine kinase prediction accuracy based on significance. Source data are provided as a separate Source Data file with this paper.
Supplementary Figure 21. Full results obtained from benchmarking analysis. Kinase-specific accuracy scores (P hit,k ) for all kinases perturbed in the benchmarking dataset. All results seen here were used to calculate the final P hit scores in the barplots from Figure  b a requirement is enforced. Box indicates median (center line), 25th and 75th percentiles (box boundaries), 1.5x the IQR of the box edge (whiskers), and any outliers beyond 1.5x IQR (points). If no outliers exist, whiskers indicate maxima or minima. There is a loss of available kinase predictions for serine/threonine kinases when implementing substrate requirements for KSEA or PTM-SEA. Source data are provided as a separate Source Data file with this paper.
(box boundaries), 1.5x the IQR of the box edge (whiskers), and any outliers beyond 1.5x IQR (points). If no outliers exist, whiskers indicate maxima or minima. There is a loss of available kinase predictions for tyrosine kinases when implementing substrate requirements for KSEA or PTM-SEA. While KARP does not explicitly define a substrate requirement, it also relies on PhosphoSitePlus annotations, so would be similarly affected by such a requirement as KSEA. B) Distribution of the number of serine/threonine kinases with a prediction from KSEA or PTM-SEA across the benchmarking dataset (described in Supplementary Table 4, n = 33 biologically independent experiments), based on whether a substrate Supplementary Figure 23. Applying substrate requirements significantly reduces overall kinome coverage of many activity inference algorithms. For algorithms relying on kinase-substrate annotation, it is common to restrict analyses to kinases with a set number of substrates. For KSEA, which relies on annotation from PhosphoSitePlus, analysis is restricted to kinases with at least five substrates by default. PTM-SEA, which relies on PTMsigDB, restricts analysis to kinases with at least 10 substrates by default. To determine the impact of these requirements, we asked how many kinases had available predictions for each condition in the benchmarking dataset. We found that, in both cases, the substrate requirement significantly reduced the number of available kinase predictions. For tyrosine kinases, there were several cases where less than 5 kinases had available predictions for a given condition. While these predictions are likely to be more robust as more data is associated with each prediction, the lack of kinome coverage significantly reduces the utility of these algorithms. Further, kinases that tend to be connected to well-studied substrates will be more likely to have available predictions, as well studied sites are more likely to be seen in any given experiment (Supplementary Note 2). For that reason, all accuracy metrics used in benchmarking were calculated using results without this requirement. A) Distribution of the number of tyrosine kinases with a prediction from KSEA or PTM-SEA across the benchmarking dataset (described in Supplementary Table 3 To determine whether the general pruning procedure described in this work could also improve the prediction ability of other algorithms, we applied KSEA [26] to the benchmarking dataset using different kinase-substrate networks for A) Tyrosine kinases or B) serine/threonine kinases. Kinase-substrate information was obtained from three different sources: 1) Annotations: known kinasesubstrate interactions stored in PhosphoSitePlus (most commonly used, approach used in the main text), 2) Thresholded: predictions from a NetworKIN graph thresholded with a value of 2 (default in the KSEA web app [30]), and 3) Pruned: KSTAR networks generated through a heuristic prune of the weighted NetworKIN graph (described in Figure 1 of the main text and Supplementary Methods). For the pruned networks, accuracy was either calculated for a single pruned network or based on the median scores/p-values for each kinase across the 50 networks. The former is reported as the median accuracy across the 50 networks, with the accuracy obtained from each single network indicated by the black points. For reference, we also included the accuracy of KSTAR in each plot, which matches Figure 3A of the main text. Source data are provided as a separate Source Data file with this paper.

a) EGFR Stimulation b) ATR inhibition
are provided as a separate Source Data file with this paper.
Wolf-Yadlin et al. [10], which is experiment #4 from Supplementary  The average random and targeted loss curves for each tyrosine kinase tested in the data loss experiment (corresponds to Figure 4 of the main text). Rows correspond to a specific kinase, columns correspond to a specific algorithm. The x-axis indicates increasing data loss from 0 to 50%, and the y-axis indicates the false discovery rate of the predicted activity for the kinase of interest, ranging from 0 to 1. The average sensitivity to data loss (blue) and study bias (green) are displayed in the upper left of each plot (these values correspond to heatmap in Supplementary Figure 21). In cases where the curve stops before reaching 50% data loss (indicated by a black point at the stop point), this occurs because there is no longer any available substrates in the data that can be used for prediction in either PhosphoSitePlus [  Supplementary Figure 27. Serine/threonine kinase predictions lose significance with increasing data loss. The average random or targeted loss curves for each serine/threonine kinase tested in the data loss experiment (corresponds to Figure 4 of the main text). Rows correspond to a specific kinase, columns correspond to a specific algorithm. The x-axis of each plot indicates increasing data loss from 0-50%, and the y-axis indicates the false discovery rate of the predicted activity for the kinase of interest, ranging from 0 to 1. The average sensitivity to data loss (blue) and study bias (green) are displayed in the upper left of each plot (these values correspond to heatmap in Supplementary Figure 21). In cases where the curve stops before reaching 50% data loss, this occurs because there is no longer any available substrates in the data that can be used for prediction in either PhosphoSitePlus [1] (for KSEA) or PTMsigDB [27](for PTM-SEA). Source data are provided as a separate Source Data file with this paper.

a) b)
Supplementary Figure 28. Kinase-specific sensitivity to data loss and study bias for different activity inference methods. Heatmaps indicating the average sensitivity to data loss and study bias for individual kinases, rather than the whole benchmarking dataset. As described in the main text, "sensitivity to data loss" is defined as the area under the random attack curve, while "sensitivity to study bias" is defined as the difference in the area under the curve between the targeted and random attack. Cells in the heatmap that are gray indicate that predictions were either not available or were not found to be significant in the full dataset. We found that in addition to KSTAR being generally less sensitive to both random and targeted attacks, certain kinase predictions tended to suffer more from targeted attack (removal of well-studied sites), such as KDR (also known as VEGFR2), EPHA2, and MAP2K1 .A) Tyrosine kinases B) Serine/Threonine Kinases. Source data are provided as a separate Source Data file with this paper.

34/71
Supplementary Note 5: Robustness analysis comparing NSCLC and CML cell lines from independent experiments Goal Apply KSTAR and KEA3 [29] predictions to multiple different datasets from different labs profiling non small cell lung carcinoma (NSCLC) and/or chronic myeloid leukemia (CML) cell lines to determine if predictions can identify tissue similarities between datasets even when the identified sites differ. These results correspond to Figure 5 in the body of the paper.

Methods
A total of 11 datasets from 7 studies and 5 labs were obtained from the original publications (described in Supplementary  [13]. For each dataset, only data corresponding to an untreated cell line was utilized. To generate predictions for each dataset, any site identified in the experiment, regardless of quantification, was utilized as evidence. In KSTAR, the phophorylated sites were inputted into the algorithm to produce activity scores and false positive rates. In KEA3, the gene names associated with each phosphorylated site were extracted and inputted into the KEA3 API in python to generate integrated mean ranks, where low mean ranks indicate that the kinase was found to be one of the more enriched kinases in many different protein-protein interaction databases. Spearman rank correlation was used (either with KSTAR activity scores or KEA3 mean ranks) to assess the similarity of predictions across datasets.

Table of Contents
Supplementary Figure  Heat cells surrounded by a black box indicate that datasets were obtained from the same study. Overall, this demonstrates that KEA3 was unable to clearly differentiate between the tissues across datasets, with most predictions exhibiting high correlation between each other, regardless of tissue type. Source data are provided as a separate Source Data file with this paper.

37/71
Supplementary Note 6: Full analysis results of breast cancer phosphoproteomic datasets Goal 1. Assess the ability of kinase activity inference approaches to predict treatment response in the clinic, particularly in the context of HER2 activity in breast cancer 2. Compare the efficacy of KSTAR, KEA3, and KSEA in the clinical setting Methods In this supplement, we have provided the KSTAR predictions for all kinase and samples discussed in Figure 6 of the main text, focused on predicting kinase activity in breast tumor biopsies. For the microscaled biopsies of HER2+ patients [36], we have included multiple different dotplots to seperate between the pre-and post-treatment data and provide the predictions for serine/threonine kinases in pre-treatment samples (to illustrate that serine/threonine kinase activity profiles do not appear to differentiate between responders and nonresponders). In addition, where relevant, we have compared our KSTAR results to the results obtained by either KSEA (Supplementary Figure 32) or KEA3 (Supplementary Figures 32, 38, and 40). We found that KSEA results from the tumor biopsies tended to be nonrobust and difficult to apply in this setting (due to low numbers of known substrates and reliance on quantification), so we did not continue use of KSEA after application to the CPTAC dataset [37]. To make KEA3 more directly comparable to KSTAR results, the list of kinases with predictions from KEA3 was reduced to include only kinases that also have predictions in KSTAR. As all predictions are based on the abundance of phosphorylated tyrosines (except in Supplementary Figure 36), this also had the effect of removing serine/threonine kinases from KEA3 predictions, which are not relevant to the actual input data. The KEA3 rankings displayed throughout this supplement are then obtained based on the KEA3 mean ranks, where the 10th ranked kinase had the 10th smallest mean rank across the 50 tyrosine kinases with predictions in KSTAR.

Table of Contents
Supplementary  (-log10(p)), and dots are colored based on if the observed activity score has a false positive rate below 0.05. Kinases and samples were sorted using hierarchical clustering with ward linkage. Important clinical attributes of each patient can be found above the main dotplot, including the HER2 status and classification based on either mRNA or protein expression data. Source data are provided as a separate  Figure 6A. B) KEA3 ERBB2/HER2 ranking, relative to the 50 kinases that also have predictions in KSTAR. So, for a sample where the rank is 10, ERBB2/HER2 is the 10th most enriched kinase out of 50 other tyrosine kinases. C) KSEA z-scores and number of identified ERBB2 substrates for each CPTAC patient sample. In patients with at least one ERBB2/HER2 substrate, z-score enrichment was calculated with measured log2 transformed abundances. Positive z-scores indicate high ERBB2 activity and negative z-scores indicate low ERBB2 activity, relative to a pooled sample. Empty bars correspond to patients with no identified substrates, as indicated by PhosphoSitePlus [1]. The final enrichment score is equal to the maximum deviation from0 obtained in the running sum. Statistical significance was then calculated using a null distributionconsisting of ES scores obtained when the sample list was sorted randomly. Only KSTAR ERBB2 activitypredictions were found to be significantly correlated with HER2 status. E) HER2 status predictions for KSTAR, with active ERBB2/HER2 defined by score ≤ 1e − 3. Same table as in Figure 4A. F) HER2 status predictions for KEA3, defining active ERBB2/HER2 by an ERBB2 rank of 8 or higher. This cutoff was determined by identifying the rank cutoff with the best overall F1 score for KEA3. Source data are provided as a separate Source Data file with this paper.  Supplementary Figure 34. Full KSTAR tyrosine kinase activity predictions on microscaled biopsies Full dotplot containing all tyrosine kinase activity predictions on Microscaled Biopsies of HER2+ patients for both pre-and post-treatment [36]. Corresponds to Figure 6C. Patients were treated with a combination of chemotherapy and HER2 targeted therapy, and post-treatment biopsies were taken 48-72 hours following the beginning of treatment. Dot size corresponds to the activity score (-log10(p)), and dots are colored based on if the observed activity score has a false positive rate below 0.05. Kinases and samples were sorted using hierarchical clustering with ward linkage. Treatment status (pre-vs. post-treatment) and treatment response (responder vs. non-responder) of each sample are indicated with the context dots above the dotplot. Source data are provided as a separate Source Data file with this paper.  Figure 35. KSTAR tyrosine kinase activity predictions on microscaled biopsies prior to treatment Full dotplot containing all tyrosine kinase activity predictions on microscaled biopsies of HER2+ patients for only pre-treatment from Satpathy et al. [36]. Corresponds to Figure  6C. Dot size corresponds to the activity score (-log10(p)), and dots are colored based on if the observed activity score has a false positive rate below 0.05. Kinases and samples were sorted using hierarchical clustering with ward linkage. Treatment response to combination therapy of chemotherapy and anti-HER2 therapy (responder/pCR vs. non-responder/non-pCR) of each sample are indicated with the context dots above the dotplot. Source data are provided as a separate Source Data file with this paper.

43/71
ORJSYDOXH Supplementary Figure 36. KSTAR serine/threonine kinase activity predictions on microscaled biopsies prior to treatment Full dotplot containing all serine/threonine kinase activity predictions (for which at least one sample patient had on microscaled biopsies of HER2+ patients for only pre-treatment from Satpathy et al. [36]. Dot size corresponds to the activity score (-log10(p)), and dots are colored based on if the observed activity score has a false positive rate below 0.05. Kinases and samples were sorted using hierarchical clustering with ward linkage. Treatment response to combination therapy of chemotherapy and anti-HER2 therapy (responder/pCR vs. non-responder/non-pCR) of each sample are indicated with the context dots above the dotplot. Source data are provided as a separate Source Data file with this paper.

44/71
a c e d b Supplementary Figure 37. Patient-specific KSTAR predictions for non-pathologically complete responders (non-pCR) Patient specific dotplots of EGFR and ERBB2 KSTAR activity predictions for non-pathologically complete responders (non-pCR) to treatment with chemotherapy and anti-HER2 therapy from Satpathy et al. [36]. Each plot includes all available replicates and both pre/posttreatment samples. A) Patient BCN1326, a false positive without ERBB2 copy number amplification. B) Patient BCN1331, a pseudo-false positive with ERBB2 amplification but without increased protein expression. C) Patient BCN1335, a pseudo-false positive with ERBB2 amplification but without increased protein expression. Post-treatment data was not available. D) Patient BCN1369, who failed to respond to treatment despite having ERBB2 amplification and increased protein expression. E) Patient BCN1371, who failed to respond to treatment despite having ERBB2 amplification and increased protein expression. Post treatment data was not available. Source data are provided as a separate Source Data file with this paper.  Supplementary Figure 38. Patient-specific KEA3 ranks for non-pathologically complete responders (non-pCR) Patient specific dotplots of EGFR and ERBB2 KEA3 ranks for non-pathologically complete responders to treatment with chemotherapy and anti-HER2 therapy from Satpathy et al. [36]. Ranks are relative to the 50 tyrosine kinases found in NetworKIN in order to make ranks comparable to KSTAR. Each plot includes all available replicates and both pre/post-treatment samples. As all but one true HER2+ sample (not identified as a false positive by Satpathy et al. [36] had ERBB2 as one of the top 4 most active kinases based on KSTAR predictions, we have applied a significance cutoff of rank 4 (rank is 4 or better, kinase is deemed significantly active). See Supplementary Figure 41 for more details on this cutoff choice. A) Patient BCN1326, a false positive without ERBB2 copy number amplification. B) Patient BCN1331, a pseudo-false positive with ERBB2 amplification but without increased protein expression. C) Patient BCN1335, a pseudo-false positive with ERBB2 amplification but without increased protein expression. Post-treatment data was not available.  Figure 39. Patient-specific KSTAR predictions for pathologically complete responders (pCR) Patient specific dotplots of EGFR and ERBB2 KSTAR activity predictions for pathologically complete responders (pCR) to treatment with chemotherapy and anti-HER2 therapy from Satpathy et al. [36]. Each plot includes all available replicates and both pre/post-treatment samples. A) Patient BCN1300. This is the only pCR patient who had basal ERBB2 activity but did not see activity decrease post-treatment, indicating that ERBB2 therapy arm was not responsible for successful response.  Figure 40. Patient-specific KEA3 ranks for pathologically complete responders (pCR) Patient specific dotplots of EGFR and ERBB2 KEA3 ranks for non-pathologically complete responders to treatment with chemotherapy and anti-HER2 therapy from Satpathy et al. [36]. Ranks are relative to the 50 tyrosine kinases found in NetworKIN in order to make ranks comparable to KSTAR. Each plot includes all available replicates and both pre/post-treatment samples. As all but one true HER2+ sample (not identified as a false positive by Satpathy et al [36] had ERBB2 as one of the top 4 most active kinases based on KSTAR predictions, we have applied a significance cutoff of rank 4 (rank is 4 or better, kinase is deemed significantly active). See Supplementary Figure 41 for more details on this cutoff choice. A) Patient BCN1300. This is the only pCR patient who had basal ERBB2 activity but did not see activity decrease post-treatment, indicating that ERBB2 therapy arm was not responsible for successful response.

Supplementary Tables
Supplementary Table 1. Commonly used and available kinase activity inference methods We have compiled information regarding several popular/commonly referenced kinase activity/enrichment inference methods currently available. Included is a brief description of their implementation, as well as the input data type, background networks used, whether they use quantification, and if they account for study bias.  Figure 2 For each of the datasets used in Figure 2 of the main text, we have provided a description of the original experiment and how it was used in KSTAR analyses, including the threshold used to define binary evidence within them and the resulting evidence sizes.  [11] HER2overexpression_ WolfYadlin2006

53/71
Location of Data in Source Publication Supplementary Table 3. Phosphoproteomic datasets used in the tyrosine kinase benchmarking dataset Information about each study/condition used to benchmarking tyrosine kinase predictions, including the kinases considered to be perturbed. We have indicated the file name/location of the source data in the original publication from which the data was obtained. In cases where chemoproteomic information was available, binding information was used to determine the top 5 kinases most likely to be perturbed by drug.  [16] AURK_CellCycleStudy _ Kellenbach Table S1 No HeLa cells arrested in mitosis and treated with 0.25uM MLN8054 AURKA Down 28 [16] AURK_CellCycleStudy _ Kellenbach Table S1 No HeLa cells arrested in mitosis and treated with 1uM MLN8054 AURKA Down 29 [16] AURK_CellCycleStudy _ Kellenbach Table S1 No HeLa cells arrested in mitosis and treated with 5uM MLN8054 AURKA,AURKB Down 30 [16] AURK_CellCycleStudy _ Kellenbach Table S1 No Hela cells arrested in mitosis and treated with AZD1152 AURKB Down 31 [16] AURK_CellCycleStudy _ Kellenbach Table S1 No HeLa cells arrested in mitosis and treated with BI2536 PLK1,PLK2,PLK3 Down 32 [17] CDK8/19_inhibition_ Poss Table S1 No HCT116 cells treated with cortistatinA CDK8,CDK19 Down 33 [18] CK2_inhibition_ Franchin  Tables S1-S6  No  HEK293T cells treated with  quinalizarin for 3 [20] CDK1_inhib_ Petrone Table S1 No HeLa cells treated with Flavopiridol, a CDK1 inhibitor CDK1 Down 43 [20] CDK1_inhib_ Petrone  [23] DNA_damage_Beli Table S1 No U2OS cells treated with epotoside for 24 hours ATR,ATM Up 51 [23] DNA_damage_Beli Table S1 No U2OS cells treated with ionizing radiation for 1 hour ATR,ATM Up

55/71
Supplementary Table 5. Phosphoproteomic datasets used for robustness analysis in Figure 5 For each of the datasets used in Figure  5 of the main text, we have provided a description of the original experiment, the conditions/cell lines we used from this experiment, and the phosphorylation sites used for prediction

56/71
Supplementary Table 6. Breast cancer datasets used in Figure 6 For each of the datasets used in Figure 6 of the main text, we have provided a description of the original experiment and how it was used for KSTAR analysis  (Figures 2-5), we have included information regarding the ancestry and demographic informaiton of the cell lines used. All information was obtained from Cellosaurus [38]. For patient focused demographic information, please reference the original publications ( Figure 6A: Supplementary Table 1 in Mertins et al. [37], Figure 6B: Supplementary Data 1 in Huang et al. [8], Figure 6C: Supplementary Table1B from Satpathy et al. [36]).

a b
Problem 2: Thresholded kinase-substrate prediction networks have been shown to exhibit poor performance relative to using kinase-substrate annotations alone.

Solution:
We hypothesized there is useful information in these kinase-substrate networks, but that the single kinase-substrate network generated by thresholding contains too many incorrect predictions and issues with study bias to be useful. Instead, we proposed that we could create an ensemble of many possible kinase-substrate networks using the network edge weights to guide edge selection, rather than relying on single representation of kinase-substrate relationships (heuristic prune, Algorithm 2). This approach is built on the idea that all networks are wrong, but they are wrong in different ways. While any one single network is unlikely to be the correct representation, aggregating information across these networks will allow us to converge on the kinases most likely to be active. We have demonstrated the utility of ensemble approaches in some of our prior work, such as for clustering with OpenEnsembles [39].
Supplementary Figure 42. KSTAR Expands the Number of Phosphorylation Sites Used in Prediction A) The total number of unique substrates with a edge/interaction with at least one kinase for known kinase-substrate annotations in PhosphoSitePlus [1], or predicted interactions from at hresholded NetworKIN [2], or predicted interactions from KSTAR networks. B) The same information as A, but provided as a fraction of the total phosphoproteome. Source data are provided as a separate Source Data file with this paper.
Problem 3: Thresholded kinase-substrate prediction networks result in the emergence of "Hub" substrates, which provide evidence for a disproportionate number of kinases. Solution: During the heuristic prune, we limit the number of edges each substrate can have in each network (number of kinases it can provide evidence for).

Weighted Edges
Thresholded Edges a b c Problem 4: Thresholded kinase-substrate prediction networks result in the emergence of "Hub" kinases, which are connected to a disproportionate number of substrates. In addition, many kinases often lose the majority of their edges in the network, losing the ability to generate predictions for these kinases. Solution: During the heuristic prune, force all kinases to have the same number of edges in each network, ensuring that no one kinase has too few/too many substrates providing evidence for them.
Problem 5: Kinase-substrate prediction networks exhibit high study bias, where certain kinases are more likely to be connected to well-studied sites, which leads to kinase-specific false positive rates. Solution: Enforce a rule during the heuristic prune that ensures every kinase has edges with the same distribution of study bias, as defined by the number of phosphoproteome compendia they are identified in. This means that no one kinase is connected to more well studied sites (site found in most compendia) and no one kinase is connected to more poorly studied sites (site found in few compendia).

Pruned Edges
Supplementary Figure 43. Effect of Heuristic Prune on Hub Substrates in NetworKIN Here, we demonstrate the impact of the prune on substrate hubs, using PI3KR1 Y12 as an example (In network graphs, PI3KR1 Y12 is indicated by light blue node, dark blue nodes represent a unique kinase). A) Example hub substrate in NetworKIN thresholded with a value of 1 and B) the same substrate in two of the 50 pruned networks. C) Number of kinases each substrate is connected to in PhosphoSitePlus [1], NetworKIN thresholded with a value of 1 [2], and KSTAR pruned networks [PhosphoSitePlus: n = 1319 tyrosine substrates with at least one known interaction, NetworKIN: n = 6707 tyrosine substrates with at least one predicted interaction, KSTAR; n = 1752023 tyrosine substrates with at least one predicted interaction (only 39,268 unique substrates, each KSTAR network was counted separately to ensure that the properties/distribution of individual networks was illustrated)]. Box indicates median (center line), 25th and 75th percentiles (box boundaries), 1.5x the IQR of the box edge (whiskers), and any outliers beyond 1.5x IQR (points). If no outliers exist, whiskers indicate maxima or minima. Thresholded NetworKIN and pruned KSTAR networks exhibit similar distributions, where the median number of kinases each substrate is connected to is 2. However, KSTAR networks do not have any outlier substrates connected to more than 10 kinases ("Hub" substrates), which can have a large influence on the final predictions. Source data for c are provided as a separate Source Data file with this paper. a b c (certain kinases are connected to more substrates, or are connected to many well-studied sites), the heuristic prune enforces a rule that ensures all kinases have an equal number of edges in each pruned network, and that these edges have the same distribution of study bias for all kinases (as defined by the number of compendia each substrate is identified in). High study bias sites are found 3-5 compendia, medium study bias sites are found in 1-2 compendia, and low study bias sites are found in 0 compendia (only found in ProteomeScout [40]). A) Number of substrates each kinase is connected to in NetworKIN thresholded with a value of 1, and the distribution of study bias within these connections. In addition to certain kinases having significantly more predicted substrates in these networks, we discovered that certain kinases, such as FYN and SRC tended to be connected to more well studied sites. B) Number of substrates each kinase is connected to in a pruned network if no study bias constraint is applied (Box 1). While all kinases now have an equal number of edges, certain kinases tend to be more well connected to well studied sites (such as LCK, HCK, and FYN), leading to higher than expected, kinase-specific false positive rates. See Supplementary Note 2 for examples.C) Number of substrates each kinase is connected to in one of the pruned KSTAR networks when the study bias constraint is applied (Box 2). Unlike in B, all kinases have the same distribution of study bias among their substrate connections. This is an example of network used for the predictions in the main body of this work. Source data are provided as a separate Source Data file with this paper.
Supplementary Figure 44. KSTAR networks exhibit a more balanced distribution of kinase network edges In order to account for kinase-specific study bias issues in kinase-substrate network s Problem 6: In thresholded kinase-substrate prediction networks, kinases from the same family commonly exhibit high substrate overlap, where many of the same phosphorylation sites provide evidence for the same kinases. This reduces the ability of prediction algorithms to discriminate between the activity of these kinases.

Solution:
We have observed that the probability based selection of edges and the constraints placed on this selection (equal distribution of study bias, all kinases have the same number of edges, limits to the number of kinases a substrate can provide evidence for, etc.) during the heuristic prune procedure serve to significantly reduce the overlap between like kinases, allowing for greater discriminability in the activity of these kinases. Challenges with Kinase Activity Inference Problem 7: Quantification is required for the majority of currently available kinase activity inference algorithms. This is problematic for several reasons: 1. Relative quantification is inherently noisy, and a fold change can mean different things for different proteins. For example, a fold change of 2 can indicate an increase in abundance from 0.1 to 0.2, but it can also indicate an increase in abundance from 10 to 20, two very different outcomes.
2. More well studied sites tend to exhibit higher magnitude fold changes (See Supplementary Note 2) 3. In the clinical setting, it is often difficult to obtain a matched healthy sample required for relative quantification.
Supplementary Figure 45. Effect of Heuristic Prune on Network Similarity Between Kinases A) Heatmap depicting the overlapping substrates between kinases in NetworKIN when thresholded by a value of 1, as defined by the Jaccard similarity index of the sites providing evidence for each kinase. Kinases were sorted using hierarchical clustering. Key kinases/kinase families that exhibit high overlap are indicated by the boxes. B) Heatmap depicting the overlapping substrates between the same kinases in KSTAR networks, as defined by the average Jaccard Similarity across all KSTAR networks. Kinases were sorted using the same order in A. Source data are provided as a separate Source Data file with this paper.

63/71
Solution: Measured abundance values are converted to binary evidence based on some cutoff value relevant to the biological problem, and binary evidence is used as input into the KSTAR algorithm. We can also convert to binary evidence based only on whether a site is identified in a sample. This also allows for use of the KSTAR algorithm without relying on any quantification at all, using any sites identified in a sample as evidence, as done for Figure 5 in the main text.
Problem 8: Experiments are more likely to identify well-studied sites, which can lead to experiment-specific false positive rates (see Supplementary Note 2).
Solution: Compare enrichment p-values obtained from the real experiment to enrichment p-values that can be obtained from random phosphoproteomic experiments with the same properties as the real experiment (same number of sites used as evidence, same distribution of study bias). This is done using the Mann Whitney U-test, a non-parametric distribution test.

Implementing the KSTAR algorithm
In this section, we will expand upon our original description of the KSTAR algorithm and how it is implemented in practice. We have seperated the algorithm into two main sections for the purpose of clarity. First, we will describe the heuristic prune and how to generate the ensemble of possible kinase-substrate networks. For this section, the primary data that is required is a weighted kinase-substrate network containing predictions for the entire phosphoproteome, such as NetworKIN [2], GPS [41], or PhosphoPICK [42], and the reference human phosphoproteome obtained from KinPred [9]. The second section describes how phosphoproteomic data is converted into kinase activities using the network ensemble obtained in the first section.

The Heuristic Prune and the Generation of the Network Ensemble
The heuristic prune generates an ensemble of binary kinase-substrate networks representing possible representions of kinase-substrate interactions based on weighted graph predictions (we used NetworKIN, but theoretically can be any weighted kinase-substrate network). In this procedure, we ensure that all kinases have an equal number of edges/interactions in each binary network, set by kinase network size. We also ensured that no phosphorylation site provides evidence for more kinases than a predetermined cutoff, site limit. Lastly, we generate seperate networks for serine/threonine sites and tyrosine sites, as these are nonoverlapping networks. Once these parameters have been determined, edges are randomly selected from the weighted network and added to the pruned network according to a probability equal to edge weights in the weighted network. Importantly, each kinase in the network will have edges added at the same rate (all kinases will have one edge in the pruned network before any have two, etc.), which ensures that every kinase is most likely to have their highest probability edges present in the final networks. If any substrate in the pruned network obtains a number of edges equal to site limit, it is removed from the weighted network and cannot serve as evidence for any other kinases. Because edge selection is ultimately probabilistic in nature, every network generated using this procedure will be a unique representation of a possible kinase-substrate landscape. The implementation of this approach for a single network is described in Box 1.
We noticed that the above approach resulted in a high number of false positives for certain kinases and experiments (see Problem 5 and Supplementary Note 2), and determined that this was likely a result of the study bias found within these networks, where well studied substrates tend to have high influence on predictions. To account for study bias, we modified the approach described in Box 1 so that each kinase has edges/interactions with the same distribution of study bias (i.e. no kinases have only well-studied sites providing evidence for them, and no kinases have only poorly studied sites providing evidence for them). To do so, we defined study bias for each substrate in the human phosphoproteome as the number of different phosphoproteome compendia they are identified in. The compendia used in this study are PhosphoSitePlus [1], phosphoELM [5], HRPD [6], dbPTM [7], and UniProt [43].
The heuristic prune described in Box 2 is similar to the procedure in Box 1, except that each kinase in the final binary network contains an equal distribution of sites found in 0, 1, 2, 3, 4, or 5 compendia (Supplementary Figure 44). Box 2 illustrates is the prune implementation we utilize throughout the